Code
import time
from pathlib import Path
import pandas as pd
from data import utils
from lets_plot import *
from lets_plot.mapping import as_discrete
LetsPlot.setup_html()
Adam Cseresznye
November 3, 2023
Welcome to our project focusing on understanding the key factors that impact real estate property prices in Belgium. Our ultimate goal is to leverage data collected from immoweb.be, a prominent real estate platform in the country, to predict house prices in Belgium.
You can access the project’s app through its Streamlit website.
The app is divided into three sections:
Intro: This section provides a basic overview of how the project is structured, how data is handled, and how the models are trained.
Explore Data: In this section, you can explore the data interactively using boxplots and scatter plots. It allows you to visualize how specific variables impact property prices.
Make Predictions: On the “Make Predictions” page, you can input certain variables and generate your own predictions based on the latest trained model.
It’s worth noting that we maintain the app’s accuracy by regularly updating the data through GitHub Actions, which scrapes and retrains the model every month. To test your skills against my base test RMSE score, you can download and use the dataset I uploaded to my Kaggle account through Kaggle Datasets.
I’m excited to see what you can come up with using this tool. Feel free to explore and experiment with the app, and don’t hesitate to ask if you have any questions or need assistance with anything related to it.
In a series of blog posts, we will guide you through the thought process that led to the creation of the Streamlit application. Feel free to explore the topics that pique your interest or that you’d like to learn more about. We hope you’ll find this information valuable for your own projects. Let’s get started!
Note: Although the data collection process is not described in detail here, you can find the complete workflow in the src/main.py
file, specifically focusing on the relevant functions and methods in src/data/make_dataset.py
. Feel free to explore it further. In summary, we utilized the request_html
library to scrape all available data, which we will show you how to process in subsequent notebooks.
In case you encounter difficulties importing your own modules, I found this Stack Overflow question to be quite helpful. To resolve this issue, you can follow these steps:
Create a .pth
file that contains the path to the folder where your module is located. For example, prepare a .pth
file with the content: C:\Users\myname\house_price_prediction\src
.
Place this .pth
file into the following folder: C:\Users\myname\AppData\Roaming\Python\Python310\site-packages
. This folder is already included in your PYTHONPATH
, allowing Python to recognize your package directory.
To verify what folders are in your PYTHONPATH
, you can check it using the import sys
and sys.path
commands.
Once you’ve completed these steps, you’ll be able to import the utils
module with the following statement: `from data importach out.
In the realm of web scraping, managing the sheer volume of data is often the initial hurdle to conquer. It’s not so much about deciding what data to collect but rather what data to retain. As we delve into the data collected from the Imoweb website, we are met with a plethora of listings, each offering a unique set of information.
For many of these listings, there are commonalities – details like location and price tend to be constants. However, interspersed among them are those one-of-a-kind nuggets of information, such as the number of swimming pools available that obviously will be unique to certain listings. While these specific details can certainly be vital in assessing the value of certain listings, the downside is that they can lead to a sparse dataset.
Now, let’s import our initial dataset to examine the features that are commonly shared among most ads, i.e., those that are filled in most frequently. After identifying these common attributes, we can optimize our data collection process by keeping these key characteristics and removing the less common ones.
As depicted in Figure 1, the features ‘day of retrieval,’ ‘url,’ and ‘Energy Class’ demonstrate the highest completeness, with more than 90% of instances being present. In contrast, ‘dining room,’ ‘office,’ and ‘TV cable’ are among the least populated features, with roughly 10-20% of non-missing instances.
This information allows us to devise a strategy where we, for example, could retain features with a completeness of over 50%. We will delve deeper into this question in our subsequent notebooks.
# Getting the column names with lowest missing values
lowest_missing_value_columns = (
df.notna()
.sum()
.div(df.shape[0])
.mul(100)
.sort_values(ascending=False)
.head(50)
.round(1)
)
indexes_to_keep = lowest_missing_value_columns.index
(
lowest_missing_value_columns.reset_index()
.rename(columns={"index": "column", 0: "perc_values_present"})
.assign(
Has_non_missing_values_above_50_pct=lambda df: df.perc_values_present.gt(50),
perc_values_present=lambda df: df.perc_values_present - 50,
)
.pipe(
lambda df: ggplot(
df,
aes(
"perc_values_present",
"column",
fill="Has_non_missing_values_above_50_pct",
),
)
+ geom_bar(stat="identity", orientation="y", show_legend=False)
+ ggsize(800, 1000)
+ labs(
title="Top 50 Features with Non-Missing Values Above 50%",
subtitle="""The plot illustrates that the features 'day of retrieval,' 'url,' and 'Energy Class' exhibited the
highest completeness, with over 90% of instances present. Conversely, 'dining room','office,' and 'TV cable'
were among the least populated features, with approximately 10-20% of non-missing instances.
""",
x="Percentage of Instances Present with Reference Point at 50%",
y="",
caption="https://www.immoweb.be/",
)
+ theme(
plot_subtitle=element_text(
size=12, face="italic"
), # Customize subtitle appearance
plot_title=element_text(size=15, face="bold"), # Customize title appearance
)
)
)
That’s all for now. In part 2, we will examine the downloaded raw data and investigate the error messages we encountered during the web scraping process with the goal of understanding how to overcome these challenges. See you in the next installment!
---
title: 'Predicting Belgian Real Estate Prices: Part 1: Feature Selection for Web Scraping'
author: Adam Cseresznye
date: '2023-11-03'
categories:
- Predicting Belgian Real Estate Prices
jupyter: python3
toc: true
format:
html:
code-fold: true
code-tools: true
---
![Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash](https://cf.bstatic.com/xdata/images/hotel/max1024x768/408003083.jpg?k=c49b5c4a2346b3ab002b9d1b22dbfb596cee523b53abef2550d0c92d0faf2d8b&o=&hp=1){fig-align="center" width=50%}
Welcome to our project focusing on understanding the key factors that impact real estate property prices in Belgium. Our ultimate goal is to leverage data collected from _immoweb.be_, a prominent real estate platform in the country, to predict house prices in Belgium.
::: {.callout-note}
You can access the project's app through its [Streamlit website](https://belgian-house-price-predictor.streamlit.app/).
:::
The app is divided into three sections:
1. **Intro**: This section provides a basic overview of how the project is structured, how data is handled, and how the models are trained.
2. **Explore Data**: In this section, you can explore the data interactively using boxplots and scatter plots. It allows you to visualize how specific variables impact property prices.
3. **Make Predictions**: On the "Make Predictions" page, you can input certain variables and generate your own predictions based on the latest trained model.
It's worth noting that we maintain the app's accuracy by regularly updating the data through GitHub Actions, which scrapes and retrains the model every month. To test your skills against my base test RMSE score, you can download and use the [dataset](https://www.kaggle.com/datasets/unworried1686/belgian-property-prices-2023/data) I uploaded to my Kaggle account through Kaggle Datasets.
I'm excited to see what you can come up with using this tool. Feel free to explore and experiment with the app, and don't hesitate to ask if you have any questions or need assistance with anything related to it.
In a series of blog posts, we will guide you through the thought process that led to the creation of the Streamlit application. Feel free to explore the topics that pique your interest or that you'd like to learn more about. We hope you'll find this information valuable for your own projects. Let's get started!
Note: Although the data collection process is not described in detail here, you can find the complete workflow in the `src/main.py` file, specifically focusing on the relevant functions and methods in `src/data/make_dataset.py`. Feel free to explore it further. In summary, we utilized the `request_html` library to scrape all available data, which we will show you how to process in subsequent notebooks.
::: {.callout-tip title="How to import your own module using a .pth file"}
In case you encounter difficulties importing your own modules, I found this [Stack Overflow question](https://stackoverflow.com/questions/700375/how-to-add-a-python-import-path-using-a-pth-file) to be quite helpful. To resolve this issue, you can follow these steps:
1. Create a `.pth` file that contains the path to the folder where your module is located. For example, prepare a `.pth` file with the content: `C:\Users\myname\house_price_prediction\src`.
2. Place this `.pth` file into the following folder: `C:\Users\myname\AppData\Roaming\Python\Python310\site-packages`. This folder is already included in your `PYTHONPATH`, allowing Python to recognize your package directory.
3. To verify what folders are in your `PYTHONPATH`, you can check it using the `import sys` and `sys.path` commands.
Once you've completed these steps, you'll be able to import the `utils` module with the following statement: `from data importach out.
:::
# Import data
```{python}
import time
from pathlib import Path
import pandas as pd
from data import utils
from lets_plot import *
from lets_plot.mapping import as_discrete
LetsPlot.setup_html()
```
# Select Columns to Retain Based on the Quantity of Missing Values
In the realm of web scraping, managing the sheer volume of data is often the initial hurdle to conquer. It's not so much about deciding what data to collect but rather what data to retain. As we delve into the data collected from the Imoweb website, we are met with a plethora of listings, each offering a unique set of information.
For many of these listings, there are commonalities – details like location and price tend to be constants. However, interspersed among them are those one-of-a-kind nuggets of information, such as the number of swimming pools available that obviously will be unique to certain listings. While these specific details can certainly be vital in assessing the value of certain listings, the downside is that they can lead to a sparse dataset.
Now, let's import our initial dataset to examine the features that are commonly shared among most ads, i.e., those that are filled in most frequently. After identifying these common attributes, we can optimize our data collection process by keeping these key characteristics and removing the less common ones.
```{python}
df = pd.read_parquet(
utils.Configuration.RAW_DATA_PATH.joinpath(
"complete_dataset_2023-09-27_for_NB2.parquet.gzip"
)
)
```
As depicted in @fig-fig1, the features 'day of retrieval,' 'url,' and 'Energy Class' demonstrate the highest completeness, with more than 90% of instances being present. In contrast, 'dining room,' 'office,' and 'TV cable' are among the least populated features, with roughly 10-20% of non-missing instances.
This information allows us to devise a strategy where we, for example, could retain features with a completeness of over 50%. We will delve deeper into this question in our subsequent notebooks.
```{python}
#| fig-cap: Top 50 Features with Non-Missing Values Above 50%
#| label: fig-fig1
# Getting the column names with lowest missing values
lowest_missing_value_columns = (
df.notna()
.sum()
.div(df.shape[0])
.mul(100)
.sort_values(ascending=False)
.head(50)
.round(1)
)
indexes_to_keep = lowest_missing_value_columns.index
(
lowest_missing_value_columns.reset_index()
.rename(columns={"index": "column", 0: "perc_values_present"})
.assign(
Has_non_missing_values_above_50_pct=lambda df: df.perc_values_present.gt(50),
perc_values_present=lambda df: df.perc_values_present - 50,
)
.pipe(
lambda df: ggplot(
df,
aes(
"perc_values_present",
"column",
fill="Has_non_missing_values_above_50_pct",
),
)
+ geom_bar(stat="identity", orientation="y", show_legend=False)
+ ggsize(800, 1000)
+ labs(
title="Top 50 Features with Non-Missing Values Above 50%",
subtitle="""The plot illustrates that the features 'day of retrieval,' 'url,' and 'Energy Class' exhibited the
highest completeness, with over 90% of instances present. Conversely, 'dining room','office,' and 'TV cable'
were among the least populated features, with approximately 10-20% of non-missing instances.
""",
x="Percentage of Instances Present with Reference Point at 50%",
y="",
caption="https://www.immoweb.be/",
)
+ theme(
plot_subtitle=element_text(
size=12, face="italic"
), # Customize subtitle appearance
plot_title=element_text(size=15, face="bold"), # Customize title appearance
)
)
)
```
That's all for now. In part 2, we will examine the downloaded raw data and investigate the error messages we encountered during the web scraping process with the goal of understanding how to overcome these challenges. See you in the next installment!